HappyDB is a corpus of 100,000 crowd-sourced happy moments via Amazon’s Mechanical Turk. You can read more about it on https://arxiv.org/abs/1801.07746
In this R notebook, we process the raw textual data for our data analysis.
From the packages’ descriptions:
tm is a framework for text mining applications within R;tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures;tidytext allows text mining using ‘dplyr’, ‘ggplot2’, and other tidy tools;DT provides an R interface to the JavaScript library DataTables.library(tm)
library(tidytext)
library(tidyverse)
library(DT)
urlfile<-'https://raw.githubusercontent.com/rit-public/HappyDB/master/happydb/data/cleaned_hm.csv'
hm_data <- read_csv(urlfile)
We clean the text by converting all the letters to the lower case, and removing punctuation, numbers, empty words and extra white space.
corpus <- VCorpus(VectorSource(hm_data$cleaned_hm))%>%
tm_map(content_transformer(tolower))%>%
tm_map(removePunctuation)%>%
tm_map(removeNumbers)%>%
tm_map(removeWords, character(0))%>%
tm_map(stripWhitespace)
## Warning in as.POSIXlt.POSIXct(Sys.time(), tz = "GMT"): unknown timezone
## 'zone/tz/2018e.1.0/zoneinfo/America/New_York'
Stemming reduces a word to its word stem. We stem the words here and then convert the “tm” object to a “tidy” object for much faster processing.
stemmed <- tm_map(corpus, stemDocument) %>%
tidy() %>%
select(text)
We also need a dictionary to look up the words corresponding to the stems.
dict <- tidy(corpus) %>%
select(text) %>%
unnest_tokens(dictionary, text)
We remove stopwords provided by the “tidytext” package and also add custom stopwords in context of our data.
data("stop_words")
word <- c("happy","ago","yesterday","lot","today","months","month",
"happier","happiest","last","week","past")
stop_words <- stop_words %>%
bind_rows(mutate(tibble(word), lexicon = "updated"))
Here we combine the stems and the dictionary into the same “tidy” object.
completed <- stemmed %>%
mutate(id = row_number()) %>%
unnest_tokens(stems, text) %>%
bind_cols(dict) %>%
anti_join(stop_words, by = c("dictionary" = "word"))
Lastly, we complete the stems by picking the corresponding word with the highest frequency.
completed <- completed %>%
group_by(stems) %>%
count(dictionary) %>%
mutate(word = dictionary[which.max(n)]) %>%
ungroup() %>%
select(stems, word) %>%
distinct() %>%
right_join(completed) %>%
select(-stems)
We want our processed words to resemble the structure of the original happy moments. So we paste the words together to form happy moments.
completed <- completed %>%
group_by(id) %>%
summarise(text = str_c(word, collapse = " ")) %>%
ungroup()
hm_data <- hm_data %>%
mutate(id = row_number()) %>%
inner_join(completed)
datatable(hm_data)